Debellor: Open Source Modular Platform for Scalable Data Mining
نویسنده
چکیده
This paper introduces Debellor (www.debellor.org) – an open source extensible data mining platform with stream-oriented architecture, where all data transfers between elementary algorithms take the form of a stream of samples. Data streaming enables implementation of scalable algorithms, which can efficiently process large volumes of data, exceeding available memory. This is very important for data mining research and applications, since the most challenging data mining tasks involve voluminous data, either produced by a data source or generated at some intermediate stage of a complex data processing network. Advantages of data streaming are illustrated by experiments with clustering time series. The experimental results show that even for moderate-size data sets streaming is indispensable for successful execution of algorithms, otherwise the algorithms run hundreds times slower or just crash due to memory shortage. Stream architecture is particularly useful in such application domains as time series analysis, image recognition or mining data streams. It is also the only efficient architecture for implementation of online algorithms. Due to its scalability and modularity Debellor was chosen as the basis for TunedTester application – one of three pillars of TunedIT (tunedit.org) system for automatic evaluation of machine learning algorithms. The current version of Debellor is 0.6.2.
منابع مشابه
Debellor: A Data Mining Platform with Stream Architecture
This paper introduces Debellor (www.debellor.org) – an open source extensible data mining platform with stream-based architecture, where all data transfers between elementary algorithms take the form of a stream of samples. Data streaming enables implementation of scalable algorithms, which can efficiently process large volumes of data, exceeding available memory. This is very important for dat...
متن کاملSAMOA: scalable advanced massive online analysis
samoa (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several...
متن کاملVIKAMINE - Open-Source Subgroup Discovery, Pattern Mining, and Analytics
This paper presents an overview on the VIKAMINE system for subgroup discovery, pattern mining and analytics. As of VIKAMINE version 2, it is implemented as rich-client platform (RCP) application, based on the Eclipse framework. This provides for a highly-configurable environment, and allows modular extensions using plugins. We present the system, briefly discuss exemplary plugins, and provide a...
متن کاملFrequent Pattern Mining Algorithm on MapReduce Model: A Review
Frequent Itemset Mining (FIM) is one of the most well-known method to use in extract knowledge from data. FIM methods becomes more problematic when it is applied to huge amount of dataset like Big Data. There are many algorithms available for Frequent Pattern Mining. Existing algorithms has many problem when mining huge amount of datasets such as expensive because of frequently I/O disk scans, ...
متن کاملComparing Data Processing Frameworks for Scalable Clustering
Recent advances in the development of data parallel platforms have provided a significant thrust to the development of scalable data mining algorithms for analyzing massive data sets that are now commonplace. Scalable clustering is a common data mining task that finds consequential applications in astronomy, biology, social network analysis and commercial domains. The variety of platforms avail...
متن کامل